{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grouping and Pivoting\n", "
\n", "\n", "In this section, we will answer the question:\n", "\n", "**What were the most popular male and female names in each year?**\n", "\n", "Here's the Baby Names dataset once again:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYear
0MaryF92171884
1AnnaF38601884
2EmmaF25871884
3ElizabethF25491884
4MinnieF22431884
\n", "
" ], "text/plain": [ " Name Sex Count Year\n", "0 Mary F 9217 1884\n", "1 Anna F 3860 1884\n", "2 Emma F 2587 1884\n", "3 Elizabeth F 2549 1884\n", "4 Minnie F 2243 1884" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "baby = pd.read_csv('babynames.csv')\n", "baby.head()\n", "# the .head() method outputs the first five rows of the DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Breaking the Problem Down**\n", "\n", "We decompose this problem into simpler table manipulations.\n", "\n", "1. Group the `baby` DataFrame by 'Year' and 'Sex'.\n", "2. For each group, compute the most popular name.\n", "\n", "Recognizing which operation is needed for each problem is sometimes tricky. Usually, a convoluted series of steps will signal to you that there might be a simpler way to express what you want. If we didn't immediately recognize that we needed to group, for example, we might write steps like the following:\n", "\n", "1. Loop through each unique year.\n", "2. For each year, loop through each unique sex.\n", "3. For each unique year and sex, find the most common name.\n", "\n", "```{note}\n", "There is almost always a better alternative to looping over a `pandas` DataFrame. In particular, looping over unique values of a DataFrame should usually be replaced with a group.\n", "```\n", "\n", "## Grouping\n", "
\n", "\n", "To group in `pandas`. we use the `.groupby()` method." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby.groupby('Year')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.groupby()` returns a strange-looking `DataFrameGroupBy` object. We can call `.agg()` on this object with an aggregation function in order to get a familiar output:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCount
Year
1880200020002000
1881193519351935
1882212721272127
1883208420842084
1884229722972297
............
2012337153371533715
2013332533325333253
2014332063320633206
2015330633306333063
2016328683286832868
\n", "

137 rows × 3 columns

\n", "
" ], "text/plain": [ " Name Sex Count\n", "Year \n", "1880 2000 2000 2000\n", "1881 1935 1935 1935\n", "1882 2127 2127 2127\n", "1883 2084 2084 2084\n", "1884 2297 2297 2297\n", "... ... ... ...\n", "2012 33715 33715 33715\n", "2013 33253 33253 33253\n", "2014 33206 33206 33206\n", "2015 33063 33063 33063\n", "2016 32868 32868 32868\n", "\n", "[137 rows x 3 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The aggregation function takes in a series of values for each group\n", "# and outputs a single value\n", "def length(series):\n", " return len(series)\n", "\n", "# Count up number of values for each year. This is equivalent to\n", "# counting the number of rows where each year appears.\n", "baby.groupby('Year').agg(length)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might notice that the `length` function simply calls the `len` function, so we can simplify the code above." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCount
Year
1880200020002000
1881193519351935
1882212721272127
1883208420842084
1884229722972297
............
2012337153371533715
2013332533325333253
2014332063320633206
2015330633306333063
2016328683286832868
\n", "

137 rows × 3 columns

\n", "
" ], "text/plain": [ " Name Sex Count\n", "Year \n", "1880 2000 2000 2000\n", "1881 1935 1935 1935\n", "1882 2127 2127 2127\n", "1883 2084 2084 2084\n", "1884 2297 2297 2297\n", "... ... ... ...\n", "2012 33715 33715 33715\n", "2013 33253 33253 33253\n", "2014 33206 33206 33206\n", "2015 33063 33063 33063\n", "2016 32868 32868 32868\n", "\n", "[137 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby.groupby('Year').agg(len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A further shorthand to accomplish the same result would be by using .count() method as our aggregating function. Pandas has shorthands for common aggregation functions, including `count`,` sum`, and `mean`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCount
Year
1880200020002000
1881193519351935
1882212721272127
1883208420842084
1884229722972297
............
2012337153371533715
2013332533325333253
2014332063320633206
2015330633306333063
2016328683286832868
\n", "

137 rows × 3 columns

\n", "
" ], "text/plain": [ " Name Sex Count\n", "Year \n", "1880 2000 2000 2000\n", "1881 1935 1935 1935\n", "1882 2127 2127 2127\n", "1883 2084 2084 2084\n", "1884 2297 2297 2297\n", "... ... ... ...\n", "2012 33715 33715 33715\n", "2013 33253 33253 33253\n", "2014 33206 33206 33206\n", "2015 33063 33063 33063\n", "2016 32868 32868 32868\n", "\n", "[137 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby.groupby('Year').count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The aggregation is applied to each column of the DataFrame, producing redundant information. We can restrict the output columns by slicing before grouping." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Year
18802000
18811935
18822127
18832084
18842297
......
201233715
201333253
201433206
201533063
201632868
\n", "

137 rows × 1 columns

\n", "
" ], "text/plain": [ " Count\n", "Year \n", "1880 2000\n", "1881 1935\n", "1882 2127\n", "1883 2084\n", "1884 2297\n", "... ...\n", "2012 33715\n", "2013 33253\n", "2014 33206\n", "2015 33063\n", "2016 32868\n", "\n", "[137 rows x 1 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "year_rows = baby[['Year', 'Count']].groupby('Year').agg(len)\n", "year_rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the index of the resulting DataFrame now contains the unique years, so we can slice subsets of years using `.loc` as before:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Year
18802000
19003730
192010755
19408961
196011924
198019440
200029764
\n", "
" ], "text/plain": [ " Count\n", "Year \n", "1880 2000\n", "1900 3730\n", "1920 10755\n", "1940 8961\n", "1960 11924\n", "1980 19440\n", "2000 29764" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Every twentieth year starting at 1880\n", "year_rows.loc[1880:2016:20, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping on Multiple Columns\n", "
\n", "\n", "We can group on multiple columns to get groups based on unique pairs of values. To do this, pass in a list of column labels into `.groupby()`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
YearSex
1880F90992
M110491
1881F91953
M100743
1882F107847
.........
2014M1913434
2015F1776538
M1907211
2016F1756647
M1880674
\n", "

274 rows × 1 columns

\n", "
" ], "text/plain": [ " Count\n", "Year Sex \n", "1880 F 90992\n", " M 110491\n", "1881 F 91953\n", " M 100743\n", "1882 F 107847\n", "... ...\n", "2014 M 1913434\n", "2015 F 1776538\n", " M 1907211\n", "2016 F 1756647\n", " M 1880674\n", "\n", "[274 rows x 1 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_counts = baby.groupby(['Year', 'Sex']).sum()\n", "grouped_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above computes the total number of babies born for each year and sex. We can now use grouping by multiple columns to compute the most popular names for each year and sex. Since the data are already sorted in descending order of Count for each year and sex, we can define an aggregation function that returns the first value in each series. (If the data weren't sorted, we can call `sort_values()` first.)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCount
YearSex
1880FMary7065
MJohn9655
1881FMary6919
MJohn8769
1882FMary8148
............
2014MNoah19263
2015FEmma20415
MNoah19594
2016FEmma19414
MNoah19015
\n", "

274 rows × 2 columns

\n", "
" ], "text/plain": [ " Name Count\n", "Year Sex \n", "1880 F Mary 7065\n", " M John 9655\n", "1881 F Mary 6919\n", " M John 8769\n", "1882 F Mary 8148\n", "... ... ...\n", "2014 M Noah 19263\n", "2015 F Emma 20415\n", " M Noah 19594\n", "2016 F Emma 19414\n", " M Noah 19015\n", "\n", "[274 rows x 2 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The most popular name is simply the first one that appears in the series\n", "def most_popular(series):\n", " return series.iloc[0]\n", "\n", "baby_pop = baby.groupby(['Year', 'Sex']).agg(most_popular)\n", "baby_pop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that grouping by multiple columns results in multiple labels for each row. This is called a \"multilevel index\" and is tricky to work with. The important thing to know is that `.loc` takes in a tuple for the row index instead of a single value:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Emily'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby_pop.loc[(2000, 'F'), 'Name']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But `.iloc` behaves the same as usual since it uses indices instead of labels:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCount
YearSex
1885FMary9128
MJohn8756
1886FMary9889
MJohn9026
1887FMary9888
\n", "
" ], "text/plain": [ " Name Count\n", "Year Sex \n", "1885 F Mary 9128\n", " M John 8756\n", "1886 F Mary 9889\n", " M John 9026\n", "1887 F Mary 9888" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby_pop.iloc[10:15, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivoting\n", "
\n", "\n", "**If you group by two columns, you can often use pivot to present your data in a more convenient format.** Using a pivot lets you use one set of grouped labels as the columns of the resulting table.\n", "\n", "To pivot, use the `pd.pivot_table()` function." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexFM
Year
1880MaryJohn
1881MaryJohn
1882MaryJohn
1883MaryJohn
1884MaryJohn
.........
2012SophiaJacob
2013SophiaNoah
2014EmmaNoah
2015EmmaNoah
2016EmmaNoah
\n", "

137 rows × 2 columns

\n", "
" ], "text/plain": [ "Sex F M\n", "Year \n", "1880 Mary John\n", "1881 Mary John\n", "1882 Mary John\n", "1883 Mary John\n", "1884 Mary John\n", "... ... ...\n", "2012 Sophia Jacob\n", "2013 Sophia Noah\n", "2014 Emma Noah\n", "2015 Emma Noah\n", "2016 Emma Noah\n", "\n", "[137 rows x 2 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(baby,\n", " index='Year', # Index for rows\n", " columns='Sex', # Columns\n", " values='Name', # Values in table\n", " aggfunc=most_popular) # Aggregation function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare this result to the `baby_pop` table that we computed using `.groupby()`. We can see that the `Sex` index in `baby_pop` became the columns of the pivot table." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCount
YearSex
1880FMary7065
MJohn9655
1881FMary6919
............
2015MNoah19594
2016FEmma19414
MNoah19015
\n", "

274 rows × 2 columns

\n", "
" ], "text/plain": [ " Name Count\n", "Year Sex \n", "1880 F Mary 7065\n", " M John 9655\n", "1881 F Mary 6919\n", "... ... ...\n", "2015 M Noah 19594\n", "2016 F Emma 19414\n", " M Noah 19015\n", "\n", "[274 rows x 2 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby_pop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "
\n", "\n", "We now have the most popular baby names for each sex and year in our dataset and learned to express the following operations in `pandas`:\n", "\n", "| Operation | `pandas` |\n", "| --------- | ------- |\n", "| Group | `df.groupby(label)` |\n", "| Group by multiple columns | `df.groupby([label1, label2])` |\n", "| Group and aggregate | `df.groupby(label).agg(func)` |\n", "| Pivot | `pd.pivot_table()` |" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }